Demo notebook for Niimpy Exploration layer modules

Introduction

To study and quantify human behavior using longitudinal multimodal digital data, it is essential to get to know the data well first. Theese data from various sources or sensors, such as smartphones and watches and activity trackers, yields data with different types and properties. The data may be a mixture of categorical, ordinal and numerical data, typically consisting of time series measured for multiple subjetcs from different groups. While the data is typically dense, it is also heterogenous and contain lots of missing values. Therefore, the analysis has to be conducted on many different levels.

This notebook introduces the Niimpy toolbox exploration module, which seeks to address the aforementioned issues. The module has functionalities for exploratory data analysis (EDA) of digital behavioral data. The module aims to produce a summary of the data characteristics, inspecting the structures underlying the data, to detecting patterns and changes in the patterns, and to assess the data quality (e.g., missing data, outliers). This information is highly essential for assessing data validity, data filtering and selection, and for data preprocessing. The module includes functions for plotting catogorical data, data counts, timeseries lineplots, punchcards and visualizing missing data.

Exploration module functions are supposed to run after data preprocessing, but those can be run also on the raw observations. All the functions are implemented by using Plotly Python Open sourde Library. Plotly enables interactive visualizations which in turn makers it easier to explore different aspects of the data (e.g.,specific timerange and summary statistics).

This notebook uses several sample dataframes for module demonstration. The sample data is already preprocessed, or will be preprocessed in notebook sections before visualizations. When the sample data is loaded, some of the key characteristics of the data are displayed.

All eploration module functions require the data to follow data schema. defined in the Niimpy toolbox documentation. The user must ensure that the input data follows the specified schema.



Sub-module overview

The following table shows accepted data types, visualization functions and the purpose of each exploration sub-module.

Sub-module Data type Functions For what
Categorical plot Categorical Barplot Observations counts and distributions
Count plot Categorical* / Numerical Barplot/Boxplot Observation counts and distibutions
Lineplot Numerical Lineplot Trend, cyclicity, patterns
Punchcard Categorical* / Numerical Heatmap Temporal patterns of counts or values
Missingness Categorical / Numerical Barplot / Heatmap Missing data patterns

Data types denoted with * are not compatible with every function within the module.


NOTES

This notebook uses following definitions referring to data:


Imports

Here we import modules needed for running this notebook.

Plotly settings

Next code block defines default settings for plotly visualizations. Feel free to adjust the settings according to your needs.

Table of Contents

1) Categorical plot

This section introduces categorical plot module visualizes categorical data, such as questionnaire data responses.
We will demonstrate functions by using a mock survey dataframe, containing answers for:

The data will be preprocessed, and then it's basic characteristics will be summarized before visualizations.

The dataframe's columns are raw questions from a survey. Some questions belong to a specific category, so we will annotate them with ids.
The id is constructed from a prefix (the questionnaire category: GAD, PHQ, PSQI etc.), followed by the question number (1,2,3). Similarly, we will also the answers to meaningful numerical values.

Note: It's important that the dataframe follows the below schema before passing into niimpy.

We can use a helper method to convert the answers into numerical value. The pre-defined mapper inside survey.py would be useful for this step.

We can also produce a summary of the questionaire's score. This function can describe aggregated score over the whole population, or specific subgroups.

First we'll show statistics for the whole population:

Statistics by the group gender:

And finally statistics for PHQ questionnaires by group:

1.1. Questionnaire summary

We can now make some plots for the preprocessed data frame. First, we can display the summary for the specific question (PHQ-2 first question).

The figure shows that the answer values (from 0 to 3) almost uniform in distribution.

1.2. Questionnaire grouped summary

We can also display the summary for each subgroup (gender).

The figure shows that the differences between subgroups are not substantially high.

1.3. Questionnaire grouped summary score distribution

With some quick preprocessing, we can display the score distribution of each questionaire.

We'll extract PSS-10 questionnaire answers from the dataframe.

We'll quickly inspect the preprocessed dataframe.

And then visualize grouped summary score distribution.

The figure shows that the grouped summary score distrubutions are close to each other.

2) Countplot

This section introduces Countplot module. The module contain functions for user and group level observation count (number of datapoints per user or group) visualization and observation value distributions.
Observation counts use barplots for user level and a boxplots for group level visualizations. Boxplots are used for group level value distributions.
The module assumes that the visualized data is numerical.

Data

We will use sample from StudentLife dataset to demonstrate the module functions. The sample contains hourly aggregated activity data (values from 0 to 5) and group information based on pre- and post-study PHQ-9 test scores. Study subjects have been grouped by the depression symptom severity into groups: none, mild, moderate, moderately severe, and severe. Preprocessed data sample is included in the Niimpy toolbox sampledata folder.

Before visualizations, we'll inspect the data.

2.1. User level observation count

At first we visualize the number of observations for each subject.

The barplot shows that there are differences in user total activity counts. The user u24 has the lowest event count of 710 and users u02 and u59 have the highest count of 1584.

2.2. Group level observation count

Next we'll inspect group level activity event counts aggregated by day. For the improved clarity, we select a timerange of one week from the data.

The boxplot shows some variability in group level event count distributions across the days spanning from Mar 28 to Apr 3 2013.

2.3. Group level value distributions

Finally we visualize group level activity value distributions.

The boxplot shows that activity score distribution for groups mild and moderately severe differ from the rest.

3. Lineplot

This section introduces Lineplot module functions. We use the same StudentLife dataset derived activity data as in previous section.

3.1. Lineplot

Lineplot functions display numerical feature values on time axis. The user can optionally resample (downsample) and smoothen the data for better visual clarity.

3.1.1. Single user single feature

At first, we'll visualize single user single feature data, without resampling or smoothing.

The figure showing all the activity datapoints is difficult to interpet. By zooming in the time range, the daily patters come apparent. There is no or low activity during the night.

3.1.2. Single user single feature index resetted

Next, we'll plot visualize the same data using resampling by hour, and 24 hour rolling window smoothing for improved visualization clarity. We also reset the index, showing now hours from the first activity feature observation.

By zoomin in the smoothed lineplot, daily activity patterns are easier to detect.

3.1.3. Single user single feature, aggregated by day

Next visualization shows resamplig by day and 7 day rolling window smoothing, making the activity time series trend visible.

Daily aggregated and smoothed data makes the user activity trend visible. There is a peak at May 9 and the crest at May 23.

3.2. Multiple subjects single feature

The following visualization superimposes three subject's activity on same figure.

The figure shows that the user daily averaged activity is quite similar in the beginning of inspected time range. In first two weeks of May, the activity shows opposing trends (user u00 activity increases and user u01 decreases).

3.3. Group level hourly averages

Next we'll compare group level hourly average activity.

The time plot reveals that the hourly averaged group level activity follows circadian rhytmn (less activity during the night). Moderately severe group seems to be least active group during the latter half of the day.

3.4. Group level weekday averages

And finally,

The timeplot shows that there is some differences between the average group level activity, e.g., group mild being more active than moderately severe. Additionally, activity during Sundays is at lower level in comparison with weekdays.

4. Punchcard

This section introduces Punchcard module functions. The functions aggregate the data and show the averaged value for each timepoint.
We use the same StudentLife dataset derived activity data as in two previous sections.

4.1. Single user punchcard

At first we visualize one daily aggregated mean activity for single subject. We'll change the plot color to grayscale for improved clarity.

The punchcard reveals that May 5th has the highest average activity and May 18th, 20th, and 21th have the lowest activity.

4.2. Multiple user punchcard

Next, we'll visualize mean activity for multiple subjects.

The punchard allows comparison of daily average activity for multiple subjects. It seems that there is not evident common pattern in the activity.

4.3. Single user punchcard showing two features

Lastly, we'll visualize daily aggregated single user activity side by side with activity of previous week.
We start by shifting the activity by one week and by adding it to the original dataframe.

The punchcard show weekly repeating patterns in subjects activity.

5) Missingness

This sections introduces Missingness module for missing data inspection. The module features data missingness visualizations by frequency and by timepoint.
Additionally, it offers an option for missing data correlation visualization.

Data

For data missingness visualizations, we'll create a mock dataframe with missing values using niimpy.util.create_missing_dataframe function.

We will quickly inspect the dataframe before the visualizations.

5.1. Data frequency by feature

First, we create a histogram to visualize data frequency per column.

The data frequency is nearly similar for each user, User_5 having the highest frequency.

5.2. Average frequency by user

Next, we will show average data frequency for all users.

The overall data frequency suggests no clear pattern for data missingness.

5.3. Missingness matrix

We can also create a missingness matrix visualization for the dataframe. The nullity matrix show data missingess by a timepoint.

5.4. Missing data correlations

Finally, we plot a heatmap to display the correlations between missing data.

Correlation ranges from -1 to 1:

Data

For the correlations, we use NYC collision factors sample data.

First, we'll inspect the data frame.

We will then inspect the basic statistics.

Finally, we will visualize the nullity (how strongly the presence or absence of one variable affects the presence of another) correlations by a heatmap and a dendrogram.

The nullity heatmap and dendrogram reveals a data correlation structure, e.g., vehicle type codes and contributing factor vehicle are highly correlated. Features having complete data are not shown on the figure.